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Abstract 

We examine the efficiency of clustering a set of points, when the encompassing metric space 
may be preprocessed in advance. In computational problems of this genre, there is a first stage 
of preprocessing, whose input is a collection of points M; the next stage receives as input a 
query set Q C M, and should report a clustering of Q according to some objective, such as 
1-median, in which case the answer is a point a G M minimizing X^eQ °^/(a,<?). 

We design fast algorithms that approximately solve such problems under standard clustering 
objectives like p-center and p-median, when the metric M has low doubling dimension. By 
leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the 
query size n = \Q\, and is (almost) independent of the total number of points m = \M\. 
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1 Introduction 



Clustering is a ubiquitous computational task of prime importance in numerous applications and 
domains, including machine learning, image processing, and bioinformatics. While the clustering 
problem has several variations, it often falls within the following framework of metric clustering: 
given a set of points Q in a metric space (M, d), choose a set of centers C (in that same metric 
space) so as to minimize some objective function of Q and the centers C. For example, in the 
p-median problem, the goal is to find a set of p centers CCM that minimizes the objective 

med(Q,C) := E ff6 Q<*(g, C), 

where we define d(q,C) := min ce c d(q, c). 

Our focus here is on understanding whether an initial preprocessing stage can speed up the 
process of (metric) clustering. Concretely, we are interested in algorithms for efficient clustering 
of Q when the metric M can be preprocessed in advance. Throughout, we denote the number of 
center candidates by m = \M\, and the number of query points by n = \Q\. The goal is to answer 
queries with time close to linear in n and (almost) independent of m. To our knowledge, no previous 
research on (metric) clustering problems has addressed the issue of preprocessing. Past work has 
largely focused on the offline problem, where the entire input is given at once, either because M is 
implicit (e.g., a Euclidean space) or because M is given together with Q (called discrete centers). 
Other past work studied the online version, where points arrive one by one (the data-stream model). 

Clustering with preprocessing can model, for example, the following scenario. Consider a huge 
corpus of documents (M) with distances between the documents (d) defining a metric space. Given 
a relatively small subset of the documents (Q), we may wish to quickly cluster them using centers 
from the corpus. Since preprocessing needs to be done only once, it has the benefit that even a 
huge corpus can be processed, by pooling together many machines or by running it for several days. 

The first problem we consider is the p-median problem defined above. A second problem of 
interest, called p- center, is to find a set of p centers CCM that minimizes the objective 



cntr (Q,C) := max qeQ {d(q,C)}. 

Observe that when n = 1 and p = 1, both the p-median and p-center problems receive a sin- 
gle input point q and seek the point of M that is closest to q, which is precisely the famous 
nearest neighbor search (NNS) problem. Even for this special case of NNS (i.e., n = p = 1), 
Krauthgamer and Lee |KL04a| have shown that achieving approximation factor better than | in a 
general metric requires query time that depends on the doubling dimension of the metric, regard- 
less of the preprocessing. (Throughout, we denote the doubling dimension by ddim = ddim(M); 



see Section L4 for a formal definition.) It thus follows that for general n and p, one must con- 
sider metrics M whose doubling dimension is bounded, and we indeed assume as such. We also 
assume that computing the distance between two points takes O(l) time. This whole approach 
follows an established line of research that covers a host of problems including nearest neighbor 
search | KL04b , HA 106 . BKL06 , |CG06 | as well as routing, distance estimation, the traveling sales- 



man problem and classification see e.g. [|AGGM06| , [KRX08|1 , [|KSW09| , gEQ7) , jTaT03| , pGK12| , and 
I BLLOSI , IGKKICH . 
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1.1 Results 



We provide the first clustering algorithms that leverage a preprocessing stage to obtain improved 
query time. Specifically, we design algorithms that compute (1+ ^-approximation for thep-median 
and p-center problems; the precise time and space bounds are presented in Table |l[ Observe that 
the query time is near-linear in n and is (almost) independent of \M\, assuming the other parameters 
(s , p and ddim) are small. For sake of simplicity, we let our results depend on the aspect ratio 
of M, denoted A = A(M). Such bounds can usually be refined, replacing e.g. log A terms with 
logn, by adapting our algorithms using known techniques and data structures, but it would clutter 
the presentation of our main ideas. Interestingly, we use essentially the same data structure for all 
problems solved. All our space bounds are expressed in terms of machine words, which as usual 
can accommodate a pointer to a data point or a single distance value. 



Problem 


Preprocessing time 


Space 


Query Time 


1-median 
Theorem 4.1 


2 0(ddim) mlogAlog logA 


20(ddim) ?7i 


0{n log n + 2°( ddim ) log A + e -0(«Wim)) 


p-median 
Theorem 7.1 


2 0(ddim) mlogAlog logA 


20(ddim) ?71 


0(n log n + e -^(pddim) ^ log n ^o( P ) 

+ £ -0(ddim) ( p log n )0(l) . log i og log A ) 


p-center 
Theorem 6.1 


2 0(ddim) mlogAlog logA 


20(ddim) m 


0{n log n + p log log log A 

_|_pP+l . £ -0(p-ddim)^ 



Table 1: Our algorithms for (1 + e)-approximation of clustering problems. 

We point two possible extensions of our results. First, one may ask about updates to M, i.e., 
inserting and deleting points. Our data structure is similar to previous work on NNS, and thus we 
expect the methods known there (see e.g. [ KL04b| ) to apply also in our case, although we did not 



check all the details. Second, we assume throughout that Q C M. One may remove this restriction, 
possibly adapting the definition of ddim and A to refer to M U Q. Again, we have not checked 
the details, but we expect this is possible by roughly applying the procedure of inserting Q to M 
before executing the query Q, except that now we cannot use points of Q \ M as centers. 

Our bounds for clustering with preprocessing are in a new model that was not studied before, and 
thus cannot be compared directly with previous work. But of course, all of our results immediately 
imply also algorithms for the respective offline problems, where the input includes both Q and 
M. Since our preprocessing time is near linear in m, these are pretty efficient as well. Even 
for the offline problems, our results are new, as we are not aware of previous work on clustering 
(p-median and p-center) in metric spaces of bounded doubling dimension. Notice that a naive 
algorithm, which exhaustively tries all possible sets of centers (with no preprocessing) finds an 
optimal solution but takes runtime (^)np, which is significantly higher even for p = 1. Another 
possible comparison is with the respective Euclidean problems; this is only for the sake of analogy 



and is discussed in Section 1.3. We also point out that our 1-median algorithm is deterministic, 
while previous algorithms achieving (l + e)-approximation for 1-median, even in Euclidean metrics, 
are randomized gnd99|, PHI02| , |KSS10|| . 
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1.2 Techniques 



Our algorithms build on several techniques from prior work. The common algorithmic paradigm for 
the NNS problem in metrics with low doubling dimension | KL04b| , HM06| , [BKL06| , pG06 ] (which in 
our context is just the special case p = n = 1), is to look for an answer (center point) by restricting 
attention to a sequence of search balls, whose radii are decreasing, usually by a constant factor. 
When the ball's radius becomes small enough, the algorithms revert to exhaustively trying a small 
set of candidates inside the search ball, with the property that at least one candidate in the set 
must be a good enough approximation to an optimal answer. Such a set of candidates is sometimes 



called a centroid set HM04 ], We follow this paradigm, but extend and modify it for our needs. 



We further borrow a technique of constructing a coreset [BHI02], which essentially assigns points 
in Q to a small set of "representatives" R, so that solving the clustering problem on the weighted 
set R provides a good approximation for clustering Q. The weight of a representative r € R is 
simply the number of query points q £ Q assigned to it. In contrast to previous work on coresets 
and on centroid sets, we have the leverage of preprocessing M, and our challenge is to quickly 
construct such sets for Q during query time. 

We also devise a new technique (new at least in the context of our clustering problems) of 
"projecting" the data structure constructed for M (during preprocessing) onto the query set Q C M . 
While a data structure for Q can be constructed from scratch in time 2°( ddim )nlogn, the projection 
can be constructed even faster, in time 0{n log n). But even more importantly, the projected data 
structure inherently provides hooks into the larger set M, and these hooks are crucial for our goal 
of locating centers in M, which is (generally) a much richer point set than Q. 



1.3 Related Work 

Metrics with bounded doubling dimension are known to generalize Euclidean metrics of fixed- 
dimension. Below we briefly mention known algorithms achieving (1 + e)-approximation for the 
p- median and p-center problems in Euclidean spaces of fixed dimension D. These are only intended 
to be a crude analogy to our results, possibly providing yet another perspective. Often, different 
tradeoffs are possible between the number of centers p and the dimension D. We do not discuss 
approximation algorithms for general metrics, as these do not achieve (1 + ^-approximation. 

We start with the p-median problem. Arora, Raghavan and Rao |ARR9g| 1 were the first to 
obtain (1 + ^-approximation, via a divide-and-conquer approach based on quadtrees and dynamic 



programming. This approach was later improved by Kolliopoulos and Rao [KR07] and by Badoiu, 
Har-Peled, and Indyk [BHI02]. Har-Peled and Mazumdar ]HM04 ] added another technique of find- 
ing coresets, and obtained running time 0(n + pP+ 2 e ~( 2D + l )v j g p+1 nlog p -). Kumar, Sabharwal, 
and Sen [ KSS10| ] showed a different approach, based on finding centroid sets, that runs in time 
2(p/ £ ) 1 n. These approaches were later combined by Chen JCheOE ], who obtains improved runtime 
when the dimension D is large. 

For the p-center problem, Agarwal and Procopiuc [ AP02| ] obtain (1 + ^-approximation in time 
0{n\ogp) + (p/e)°( Dpl Badoiu, Har-Peled, and Indyk | BHI02| show an algorithm that runs 

in time p°(p/ £ ^Dn. 
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1.4 Preliminaries 

Let (M,d) be a finite metric space. The doubling dimension of M, denoted ddim = ddim(M), is 
the smallest k > such that every ball (in M) can be covered by 2 k balls of half the radius. We 
denote the diameter of the metric by diam(M) := m.ax x ^ y ^M d(x, y), and its aspect ratio (or spread) 
by A(M) := ^'ff x ff& . 

Let r > 0. An r-net of a point set 5 C M is a subset iV C S satisfying: (a) packing property: 
for all x, y £ iV we have d(x, y) > r; and (b) covering property: for all x G S we have d(x, N) < r. 
J] Such a net always exists, and can be constructed greedily by considering the points one by one 
in an arbitrary order. 



2 Our Data Structure 

The Net Hierarchy. Our data structure is based on a lot of previous work on algorithms and 



data structures for doubling metrics, in particular for nearest neighbor search | KL04b , HM06 , 



BKL06, CG06]. But despite the overall similarity, some technical details differ slightly from each of 
those papers. Let itop := flog 2 diam(Af )] , and assume for simplicity that the minimum interpoint 
distance in M is m.m. x ^ y ^M d{x,y) = 1 (otherwise we need to introduce ibot as its logarithm). 

Let Yo = M, and for i = 1, . . . , i top let Yi be a 2*-net of Note that it is not necessarily a 2 l - 
net of M, but it does cover M indirectly via the nets at lower levels. We sometimes refer to Y{ as the 
level i net. By definition, Yj C so when we refer to y G Yi we mean the copy of y which is in Yi. 
These nets form a natural hierarchy, with Yo being on the bottom, and a singleton YJ top = {ytop} at 
the top of the hierarchy. This hierarchy may be represented by a directed acyclic graph Gm, whose 
vertex set is the union of all the nets Y, (so a point y G M may have multiple copies in this graph) , 
and with an arc from every yi G Yi to every yi-i £ Y";_i for which d(yi,y^i) < 2*. We prefer not 
to maintain the graph Gm explicitly; instead, our data structure has two main components, a tree 
T and a collection of c-lists, which are defined below. 

The Tree T. The hierarchy is represented by a tree that is defined as follows. First construct 
Gm as explained above. Next, every node in Gm keeps only one of its incoming arcs that is chosen 
arbitrarily except for giving higher priority to the arc coming from another copy of the same point 
of M (if it exists). The surviving arcs define (when ignoring the edge orientations) a tree, denoted 
T = Tm, which is rooted at ytop- Because of the prioritization rule, whenever a point y G Yi has 
only one child in the tree T, this child must correspond to the same point y but in Y^_i. Thus, 
every non-branching path in T consists of copies of the same point in M in consecutive nets. By 
contracting each such path while recording the range of nets in which it participates, we can store 
the tree T more compactly, using only 0{m) space (recall the tree has m leaves). However, as 
explained a bit later, we actually employ a more limited compaction, that results with a weaker 
space bound 2°( ddim ) m. 

We supplement T with a data structure that supports constant-time lowest common ancestor 
(LCA) queries using an additional 2°( ddim W words | HT84[| (see also [BFOC] for a simplified ver- 



sion). For the 1-center and p-center algorithms, we supplement T also with a data structure for 
weighted level ancestor queries | FM96| , KL07 |, which locate an ancestor of q G M at level i (i.e., 



1 Another common definition has a strict inequality in condition (a) rather than in (b). Our analysis can be 
adapted to this definition by changing constants. 
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in Yi) in 0(logloglog A) time. The preprocessing for the weighted level ancestor queries requires 
2°( ddim ) m logloglog A time. 

The c-Lists. For some constant c > 1 that will be determined later, we maintain for every net 
point y £ Yi a so-called c-list 

Ly,i, c ■■= {z £ Fj_i : d(y, z) < c ■ 2 1 }. 

The c- lists allow us to traverse the ball of radius c2* in the next level of the hierarchy. If c = 1, 
this list can be viewed as the set of arcs leaving y € Yi in Gm- When c > 1, these lists can be used 
(via straightforward filtering) to recover the arcs of Gm- Since c is an absolute constant, the size 
of each c-list is at most c °( ddim ) < 2°( ddim ) (see e.g. ]GKL03| , [KL04b|| ). We do not store the c-list 



explicitly for every point in every net, as this might require too much space. We say that a c-list 
of a point y € Yi is trivial if it has size 1, in which case the only point in this list must be the copy 
of y in We store only nontrivial c-lists, the number of which is at most 2°( ddim )m [ KL04b| , 

Theorem 2.1]. It follows that the total space usage for the c-lists is 2°( ddim V 

The nontrivial c-lists also limit the compaction of the tree T as follows. We compact T only 
along paths whose nodes are both non-branching and have trivial c-lists. By the above bound on 
the number of nontrivial c-lists, our limited compaction of T uses at most 2°( ddim )m space. 



Preprocessing time. The preprocessing stage first employs the data structure of [ KL04b | to 
construct the c-lists in 2°( ddim )mlog A log log A time, by simply inserting the data points one after 
the other. We then scan this structure, from top to bottom, to introduce direct pointers as dictated 
by the c-lists (i.e., from L y ^ fi to relevant L Zj j_i jC ), and also construct the tree T in its compacted 
version. This entire process takes 2°( ddim )m log A log log A time. 

Projected Tree. A key tool in getting faster runtime is a projection of the tree T = Tm onto a 
subset of points Q C M. The idea is to consider the subtree of T induced by the leaves that are 
points in Q. We will denote this projected tree T\q. Observe that this projected tree might be very 
different from the tree Tq that would be constructed for Q independently of M; in particular, the 
latter cannot contain points from M\Q. In the interest of runtime, we maintain the projected tree 
somewhat implicitly; what the data structure stores explicitly is a compacted version, in which all 
non-branching paths of T\q are contracted, and this clearly uses only 0(n) space. Notice that such 
a contracted path of T\q might contain nodes that are branching in T, which possibly correspond 
to distinct data points in M. Although we have only the compacted version of T\q at hand, we 



can implement a traversal down the un- compacted tree T\q, as described in Lemma |2.2 , 

To aid in the construction of the projected tree, we number the leaves of T in depth-first search 
(DFS) order. In addition, for every node u € T\q we denote by wt(it) the number of leaves in its 
subtree. We can compute the weight of all the nodes in T\q in time 0(n) by a simple scan. 

Lemma 2.1. When a query Q is given, the compacted version of T\q can be computed in time 
0(n log n). 

Proof (Sketch). To create T\q, first sort Q according to the DFS numbering. Notice that the order 
in which points from Q are encountered when performing a DFS on T is exactly the order in which 
they would be encountered had we performed a DFS on T\q. Hence the sorted Q gives us this 
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order. We now use LCA queries to simulate the DFS on T\q, in order to construct T\q. This is 
done as follows. Denote by Qi C Q the first i — 1 points in the ordered Q. We scan Q by the DFS 
order, and when we reach the i th point, say qi € Q, we assume we have already constructed T\Q i _ 1 
on the first i — 1 points of Q. We now wish to insert to this tree to obtain T\q v To do this, 
compute u = LCA(q , j_i, qi). This node u, which has to be part of T\q i , is either on the path from 
the leaf corresponding to qi to the root of T\Q i _ 1 , or is an ancestor of the root of T|q 4 _ 1 . To locate 
its exact position, we traverse T\Q i _ 1 from the leaf corresponding to upwards towards the root, 
testing at each node v if this is the location into which u should be inserted. The testing at v is 
performed via an LCA query between v and u. If the LCA query returns u, then the traversal 
needs to continue. If not, then u is inserted as a child of v, either breaking an edge or inserting a 
new leaf. The entire process simulates the DFS search on T\q and hence takes 0{n) time. □ 

The next lemma is used to traverse the un-compacted tree T\q while using the data structure 
of its compacted version. 

Lemma 2.2. Given the compacted T\q, a node v in the un-compacted version oJT\q together with 
its weight wt(u), and node w which is the closest descendant of v in the compacted T\q (and could 
possibly be v itself), it is possible to locate the children of v in the un-compacted T\q, together with 
their weights, in time 2°( ddim ). 

Proof. Suppose first that v is a branching node in T\q. For each child u of v in the compacted 
tree T\q, we find the respective child of v in the un-compacted T\q as follows: Run an LCA query 
between u and every child of v in T. All of those queries will return v, except for one query that 
will return the required child of v in the un-compacted tree (the one that is also an ancestor of u) . 
The total time for all such queries is 2°( ddim ). 

Suppose next that v is a non-branching node, and hence is not a part of the compacted version 
of T\q. We find the child of v that is an ancestor of w in the un-compacted tree as follows: Perform 
an LCA query between w and each of v 's children in T. All of those queries will return v, except 
for one query that will not return v, but rather the child of v that is also an ancestor of w, denoted 
by w' . Notice that in this case, w is also the closest descendant of w' in the compacted T\q, which 
is needed to continue our traversal and proceed to !«'. □ 

Standard Operations on the Net Hierarchy. A basic operation in a net hierarchy is a recur- 
sive scan, where given a point yi € Yi, we scan its c- lists and apply the same procedure recursively 
on these points. During this process, we discard any duplicates we find (e.g., if we reach the same 
point in 1^-2 via different points in 

Definition 2.1. Let y 6 Yi- A point x £ M is called a c-list-descendant of y if it can be reached 
from y using a recursive scan of the c-lists. We then also say that y is a c-list-ancestor of x. 

Lemma 2.3. Let y € Yi, and let x EYj be a c-list-descendant of y. Then d(x,y) < c2 J+1 — c2 J+1 . 

Proof The proof is by induction on i — j. The base case is trivial. For the inductive step, for every 
y € Yi, the distance between y and any of the points in its c-list is at most c2 l . For every x € Yj 
which is a c-list-descendant of y, there exists a y^_i € L y i c such that x is a c-list-descendant of 
Therefore, d(x, y) < d(x, y^x) + d(yi- U y) < c2* - c2^' +1 + c2 i = c2 i+1 - c2^' +1 . □ 

Notice that a point x can be a c-list-descendant of y even if in the tree T it is not a descendant 
of y. However, ancestors and descendants in T also have bounds on the distance between them. 
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Lemma 2.4. Let y be an ancestor of x in T, such that y G Y and x G Yj, where i > j. Then 

d(x,y) < 2 i+1 < 2 i+1 . 

Proof. The distance between a parent from Y and its child in T is at most 2\ Therefore, by 
summation on the path from y to x in T, and the triangle inequality d(x, y) < Y2k=j+i ^ = 
2*+i - 2J+ 1 . □ 

The following lemma is crucial to searching the vicinity of a given point with some refinement 
factor e > 0, by executing a recursive scan with limited depth. This process will be used several 
times in our various algorithms. 

Lemma 2.5 (Descendents Search with Refinement e). Let y GYi and x G M be such that d(x, y) < 
2 l , and suppose c > 3. Then for every refinement constant < e < 1/2, a recursive scan of c-lists 
that stops at level i — log(l/e) will traverse a point x' G Yi-iog(i/e) f or which d(x,x') < s2 l . In 
addition, the number of points traversed in such a scan is at most e ~ ( ddim ) . 



wc 



Proof. Let x e be the ancestor of x in T who is in the Yi-\ g(i/e) ne t> an d so by Lemma 2.4 
have that d(x,x e ) < 2* _log ~ +1 . We prove by induction that for every i — log - + 1 < j < i and 
every yj G Yj such that d(x,yj) < 2P + , the recursive scan of c-lists from yj will reach x £ . This 
will suffice as for every x such that d(x,y) < 2*, we also have d(x,y) < 2 i+1 . For the base case, 
j = i — log - + 1, and so d(yj,x £ ) < d(yj,x) + d(x, x e ) < 2 J+1 + 2 J < c2 J , and so x £ is in the c-list 
for yj. 

For the induction step, assume that the claim is correct for j — 1. Consider Xj-i G Y-i which is 
the ancestor of x e in T, and therefore is also an ancestor of x. Then by Lemma ^4|, d(x, Xj— i) < 2 J 
and by the induction hypothesis, a recursive scan on the c-lists starting from Xj-\ will reach x e . 
Then d(xj-\,yj) < d(xj-±,x) + d(x,yj) < 2 J + 2 J+1 < c2 J , and so a recursive scan on the c-lists 
starting from yj must go through Xj-\ and eventually reach x e . 

The number of points traversed can be bounded as follows. Each point x not in 5^_i g(i/ E ) that 
is encountered needs to scan its c-list which is of size 2°( ddim ). So at k levels beneath y we scan 
at most 2°( ddim ' fc ) points. The last level scanned is when k = log ^, so using a geometric series we 
obtain that the total number of points scanned is 2°( ddim l °s( 1 /e)) = e -0(ddim)_ n 



3 A Simple Algorithm for 1-median 

In this section we provide a simple iterative algorithm for 1-median, for the purpose of explaining 
the basic approach used by our main result for 1-median in Section |4j. This basic approach is similar, 
but not identical, to the known algorithms for NNS jKL04b| , |HM06| , |BKL06| , pG06fl . We remark 



that there is a well-known randomized algorithm that achieves an (expected) 2-approximation for 
1-median by picking a random point from Q to be the center. Below, we present a deterministic 
6-approximation algorithm, which has the advantage that it is then easily refined to achieve (1+g)— 
approximation. Unlike that randomized algorithm, ours can probably be adapted to the case where 
Q need not be a subset of M (or alternatively, when the center must come from M\Q). 

Theorem 3.1. There is an algorithm that preprocesses a finite metric M in 2°( ddim )m log A(M) log log A(M) 
time using 2°( ddim )?7i space, so that subsequent 1-median queries on a set Q C M , can be answered 
within (1 + e) -approximation (for any desired < e < \) in time n(2°( ddim ) log A + e -°( ddim )). 
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The preprocessing algorithm simply builds the net hierarchy for the metric M (see Section |2|) . 
The query algorithm is described in Figure [l[ For convenience, we use the shorthand med(y, Q) 
for med({y},Q). By convention, for all i < we define Yi := M (similarly to Yq), and note that 
the corresponding c-lists can be computed on the fly by a direct filtering of the respective c-list at 
level 0. 



1. lety<- y top 

2. for each i from it op down to —3 

3. let y <- axgmm zeLy . 7 med(Q, z) 

4. if med(Q,y) > 3n • 2 l_1 then return y. 

5. else y 4— y. 

6. return y. 

Figure 1: Simple algorithm for 1-median query on a set Q C M 



Correctness Analysis. Assume for now that the algorithm returns from line 4. (We discuss 
later the more special case where the algorithm reaches line 6.) For the following, let i* be the final 
value of i (i.e., at line 4), and let y and y refer to their values in the algorithm at the end of the 
execution. It can be verified that the condition in line 4 must fail at least once, by considering yt op 
as a potential center y, and bounding the distance between every point in Q C M to ytop using 
Lemma |2.4|. Therefore, 



Z qeQ d(q,y)>Sn-2 i '~\ and £ 9eQ y) < 3n • 2** . (1) 

Let a € M be an optimal solution to the 1-median problem on Q. Let <2j*_i 6 Yi*—i be an ancestor 
of a in T. Then d(aj*_i, a) < 2* by Lemma 2.4 . 

Lemma 3.2. d(aj*_i, y) <7-2 l , and thus aj*_i 6 L Vj i*j. 

Proof. Using the triangle inequality, the optimality of a G M, and then ([!]), 

n ■ d{ai*-i,y) < ^2 [d(ai*-i,a) + d(a,q) + d(q,y)] < ^ d(ai*-i,a) + 2'^d(q,y) < 7 ■ n2 l * . □ 
qeQ q eQ qeQ 

Lemma 3.3. X^eQ d(q, a) > n ■ 2 t *~ 1 . 



Proof. By Lemma |3.2| , when the algorithm computes y in the final iteration, one of the options it 
considers is Oi*_i, and so 

^2 d(q, y) <^2 d< y^ a i*~±) ^ ^ d ( q > a ^ + d ^ a ' a **- 1 )] - a ) + n • 2 ** • 

geQ qeQ qeQ qeQ 

Combining this with (Q) and rearranging, the lemma follows. □ 



Thus, if we returned from line 4, then using ([[]), the approximation factor achieved is 



^'^li = 6. If we returned from line 6, Lemma |3.2| holds also for i = —3, and thus at the last 
execution of line 3, we have d(a,y) < But since there cannot be two points with distance less 
than < 1, we see that y = a, and the returned point is an optimal solution a. We remark that 
a similar effect can be achieved by stopping at i = 0, possibly increasing the value of c. 



< 



S 



3.1 Refinement to (1 + e)— approximation 

We now improve the approximation factor to 1 + e for an arbitrary e € (0, |]- We can utilize the 
fact that a is a descendant of aj*_i in T, so d(a, aj*-i) < 2* , and that aj*_i € L Vi i* >c . As such, we 
perform a descendant search with refinement constant e/2, starting from each member of L V i* c . 
By Lemma ^1], we are guaranteed to traverse a point ct| such that d(a, a|) < §2**. For each point 
x traversed in this process, we compute med(Q, x), and eventually report a center candidate x with 
minimal objective value med(Q,x). Using (|l|) again, this objective value is 

med(Q, x) < med(Q, ci|) < ^d(q, a) + a|) < med(Q, a) + |n2 l < (1 + e) med(Q, a). 

3.2 Runtime Analysis 

The running time of the first part of the algorithm is 2 c '( ddim )n log A, as there are at most 0(log A) 
levels, and at each level we compute the distance from every point in Q to every point z € £y,i,7- 
In the second part of the algorithm (the descendants search) we compute the cost of each of the 
£ -0(ddim) center candidates in 0(n) time. The total runtime is n(2°( ddim ) log A + e -C(ddim)^ and 
the space usage is just that of the hierarchy, which is 2°( ddim W 



4 An Efficient Algorithm for 1-median 

Theorem 4.1. There is an algorithm that preprocesses a finite metric M of size m in time 
2°( ddim )mlog A(M) log log A(M) using 2°( ddim )m memory words, so that subsequent 1-median 
queries on a set Q C M of size n can be answered within approximation factor 1 + e (for any 
desired < e < 1/2) in time 0(nlogn) + 2°( ddim ) log A(M) + e ~0( ddim ) . 

This theorem builds on the simple algorithm from Section ||, refining the approach therein using 
two main ideas. First, as we iterate down the levels i, some query points q £ Q might get further 
away from the current center i/j G Fj. But then, picking any c- list-descendant of as the final 
center will give approximately the same contribution from those far query points. This speeds up 
the traversal down the hierarchy as query points need not be considered once they get far enough 
from yi. The second idea is to cluster query points that are close to each other, relative to the 
current level i, into one (weighted) representative point. This (crude) clustering must be computed 
quickly, and indeed it is achieved using the projection tree T\q. Once we bound the number of 
weighted representatives under consideration in each iteration, we obtain a significant speedup. 



Algorithm Description. We first describe a constant factor approximation algorithm, which is 
detailed in Figure || using a,c' > to denote sufficiently large constants. Similarly to the simple 
algorithm in Section ||, the algorithm iterates (in lines 3-12) down the levels i, while maintaining a 
candidate center yi € Y^. However, the iterations here start at the root of T\q (instead of at 2/top)- 
Observe that the next candidate yi-i is always chosen from the c-list of yi (lines 9,12). 

During the iterations, the algorithm maintains also a set Ri of representatives to some points of 
Q, those points that are not too far, as explained next. The level i representative of a point q € Q, 
denoted ri(q), is the (unique) ancestor r G Yi of q in T\q. Notice that this is the same ancestor as in 
the tree T. The algorithm also uses, for each representative r G B4, a weight denoted wt(r), which 
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1 . compute T\n 

2 . let i mot <- level(root(T| Q )); y iroot _i <- root(Tj Q ); R iioot <- {root(Tj Q )}; sum <- 

3 . foreach i from i root — 1 down to 

4 . let Ri <- 

5 . foreach r € R4+1 

6. if d(r, yi) > d ■ 2 i 

7 . then let sum «— sum + wt(r) • <i(r, yi) 

8. else let i?j <— Ri U {children of r in non-compacted T|q}. 

9. let y <- argmin^^ . JEreiii d ( r ' x ) ' wt ( r )i 

10. if sum + ^rei?i ^( r ' v) ' wt ( r ) > a " n " 

11. then return yi 

12. else <- y 

13. return y_i 

Figure 2: Efficient Algorithm for 1-median query on a set Q C M 



is the number of points in Q that have r as an ancestor in T\q. This weight is calculated for each 
node in T\q during the tree's construction in line 1. The set of representatives Ri is constructed 
(in lines 4,8) from children of in T\q, which clearly maintains the invariant Ri C Yi. In this 
process, we skip (via the condition in line 6) representatives r € Ri+i that are far enough from yi, 
in which case we add their weighted distance wt(r) • d(r,yi) to a variable called sum. The purpose 
of this variable is to accumulate all those weighted distances, but note that each weighted distance 
is taken relative to yi at the iteration in which the representative r fails the condition in line 6. 
Denote by surrij the value of variable sum at the end of iteration i. For representatives r G Ri+i 
that are close enough to y^ we need to compute their children in the un-compacted T\q (in line 
8). For simplicity sake, the algorithm's description assumes that the tree T\q is available in its 
un-compacted version. The necessary operations can be implemented using the data structure for 
the compacted version by Lemma [2.2j . 

4.1 Correctness Analysis 

We say a point q S Q is far at level i if it has no representative in Ri, which means that during 
some iteration i' > % its representative was skipped. A point q € Q is near if it is not far. Let Fi 
denote the points of Q that are far at level i, and similarly Ni = Q\Fi for the points that are near. 
Notice that Fj 5 F i+1 and N { C N i+1 . 

Let i* be the value of i at the end of the execution. This is the "last" level (time- wise) considered 
by the algorithm, and the analysis shall rely on the corresponding partition Q = Ni* U Fi* . For 
q G Q, we denote its representative in Ri by Ti(q). We let r q be the "last" representative of q, 
formally defined as follows. If q € Fi* , define i q as the smallest i such that q € Ni. Intuitively, 
this is the "last level" in which q has a representative, and also the (unique) value of i such that 
q S Ni \ N-i = Fj_i \ Fi (assuming by convention iVj*_i = and = Q). Otherwise (i.e., 

q € N*), define i q := i* . In both cases, let r q := r{ (q). Notice that r q £ lj . 

At iteration i, the variable called sum receives (in line 7) a contribution for every point q € 
Fi\Fi + \. Observe that this contribution is proportional to d{ri + \{q),yi), and the last representative 
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of q is at level i q = % + 1. Hence, r q = ri (q) = rj + i(g), and by the condition in line 6, 

d{r q ,y iq -i) = d(r i+1 (q), yi ) > c'T = c'2^ 1 . (2) 
Summing the aforementioned contributions over all iterations up to i, we see that 

sum; = ^ ^ d(r i/+ i(q),y if ) = ^ d{r q ,y iq _i). (3) 

V>i qeFi,\F il+1 q&F, 



In addition, r q £ Y{ and is an ancestor of q in T.Thus, by Lemma 2.4, 

Vg€Q, d(q,r q ) <2 { " +1 (4) 
VqeNi*, d{q,r q ) < 2** +1 . (5) 

Below, y refers to its value at the end of the execution. 

We assume from now on that the algorithm halts during some iteration and returns the value 
from line 11. Similarly to Section ||, the special case where the algorithm returns from line 13 is 
proved by replacing Eqn. (|6|) and its consequences with the fact that we reached iteration i = 0. 
Thus, at the last iteration, i = i*, the algorithm halts, and 

sunv + ^2 d(r, y) ■ wt(r) = sunv + ^ d(r q , y) > a ■ n ■ 2 l *~ 1 . (6) 

Similarly, at the previous to last iteration i = i* + 1 and y is assigned y%* , hence 

sunv + i + ^2 d(r,yi*) • wt(i) = sunv + i + ^ d(r^ +1 (q), y;*) < a ■ n ■ 2 l * . (7) 

r£Ri*+i q€Ni* +1 

This inequality holds even in the special case where i* = i roo t — 1 and there was no previous to last 
iteration. Indeed, we have that Fi* + \ = 0, sunv + i = 0, = {root(T|g)} and yi* = root(T|g), 

and therefore, sunv + i + X^rgi? * +1 d{r, yi*) = < a ■ n ■ 2 l . 

Lemma 4.2. sunv + X^eTV,* d ( r i> Vi* ) < {a + 2) ■ n ■ 2 % * . 
Proof. We write the lefthand-side as 

sunv + ^2 d{r q ,yi*) = 

= sumi* + i+ J2 d(ri* +1 (q),yi*) + ^ d(n* (q) , y { *) 

< sunv+i + J2 d(ri* +1 (q),yi*)+ ^ [d(ri*(q),ri*+i(q)) + d(n* + x(q),yi*)] 

qeNi* +1 -N it q£N it 

< sunv+i + ^2 d(n* + i(q),yi*) + n2 e+1 , 

qeNi« + l 

where the last inequality follows from v(g) being a child of ?> + i(g) in T\q. The lemma then 
follows by plugging in Eqn. (|). □ 
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For every q E we have by Eqn. (|2|) that d(q,r q ) < 2 lq+l < Ad(r q ,yi i). In addition, 



,Vi q -l), because is a c-list-descendant of yi q -\ and thus Lemma || 
applies. To simplify notation, define (3 := -7 + -7 and notice it can be made an arbitrarily small 
positive constant by controlling d. For example, it is always possible to make (3 = |. We can now 
show that with respect to the query points Fi*, our estimate surrij* is a good approximation for the 
cost of picking t/i* as the center. 



^2 d(q,yi*) < ^2 [ d (Q, r q) + d{r q ,y iq -i) + d{y ig -i,yi 



< (! + ? + F ) E = ( X + ^) SUm » 



(8) 



In addition, we show that with respect to the query points iVj*, the representatives give a good 
approximation as well. 



d(q,Vi*)< E [%,^) + d(r g ,y;*)] <n2 l ' +l + d(r q , yi *). 



(9) 



Let a € M be an optimal solution to the 1-median problem Q, and let aj*_i £ be an 

ancestor of a in T. Thus, d(eij*_i,a) < 2* . We next prove that yi* is near aj*_i, and thus also 
near a itself. 

Lemma 4.3. d(aj*_i, j/j*) < c2* and therefore aj*_i £ Lj/-*,i*,c- 
Proof. We start with the lefthand-side multiplied by n 

n • d(ai»_i,yi») = d(oj»_i,2/i») 



9GQ 

< ^[d(a,*_i,a) +d(a,g) +c%?/ i *)] 



by triangle inequality 
by optimality of a 



96Q 



n2 



* +2[ <%!/»•)+ E 



by Eqns. ®,(|) 



by Lemma 4.2 



<n2 i *+2[(l + /3)sum i *+n2 i * +1 + ^ d(r q , yi *)] 

< 5n2 r + 2(1 + p)(a + 2) • n ■ 2 l * 

< crd* . 

For the last inequality we need to pick a large enough constant c > 0. (Recall that f3 can be made 
to be 2 increasing d as needed.) Dividing all by n completes the proof. □ 

We now want to prove the guarantee of our approximation. To this end we need an upper bound 
on the cost of the algorithm's solution, which we establish by analyzing the stopping condition 
iteration. 



Lemma 4.4. £, eQ d(g, y** ) < [2 + (1 + /3)(a + 2)]n2*\ 
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Proof. First, each q is close to r q and thus 



Thus, 



q&N^ qeNi* qeN t * 



Y d ^ yi *^ = Y d (i'yi*)+ Y d ^yi*) 

q&Q q£N t * qeFi* 

< n2' r+1 + Y d ( r n Vi* ) + (1 + /3)sumi» 
<n2 i *+ 1 + (l+/3)(sum i * + Y d(r q , yi .)) 

qeNi* 

< [2 + (1 + /3)(q + 2)] n2 4 *. 



Lemma 4.5. surrij* < S<je-F-» d(q,a) 
Proof. First notice that 



d(a,y iq -i) < d(a,ai*-\) + d(a i *- 1 ,y iq - 1 ) 

< 2** + c2*« - c2^ 

< c2*« 



where the bound on d(dj*_i,?/j i) follows from Lemma 2.3. Therefore, 



^ d(g,a) > ^ [d(r„2/ i(I _i) - d(q,r q ) - d(a,y i(( _i)] > (1 - /3)sunv. 



□ 



□ 



We are now ready to provide a lower bound on the optimal solution. Recall that y refers to its 
value at the end of the algorithm. 

Lemma 4.6. ^2 q ^Q d(q, a) > (a/2 — 3)(1 — /3)n2 l (assuming the algorithm returns from line 11). 
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Proof. 



a-n-2 e 1 <sum J »+ ^ d(r q ,y) © 

< sunv + d(r q ,ai*-\) by Lemma 13 and choice of y 

< sunv + ^2 i d ( r q, <?) + d(q, a) + d(a, aj*_i)] 

< sunv +n2 i,+1 + ^ d(q,a) + n2 i * 

qeNi* 

< 3n2 l + cZ(q r , a) + cZ(q r , a) by Lemma -L5 
<3n2 i * + r ^J^d(g J a). 

□ 

We conclude that the algorithm achieves approximation factor 

E q e Q d{q,yi*) [2 + (l + /3)(q + 2)]n2'* [2 + (1 + /3)(q + 2)] 
E geQ d(q,a) ~ (l-/3)(a/2-3)n2** ~ (1 - /3)(a/2 - 3) 

4.2 Refinement to (1 + e)— approximation 

Our goal now is to improve the approximation factor to 1 + e for arbitrary e > 0. We can utilize the 
fact that a is a descendant of a,*_i in T, so d(a, aj*_i) < 2* , and that aj*_i € Ly it .i*, c - As such, 
we can perform a descendant search, as in Lemma |2.5| , starting from each member of L yit ^* yC , with 
refinement constant e' = G(e). By Lemma we are guaranteed to traverse a point a £ i such that 
<i(a, a £ i) < e'2* . However, we wish to avoid the high runtime of computing the cost of each center 
candidate by summing the distances from all of Q to that point. Instead, we once again speed up 
the process by removing far points, and using weighted representatives for the rest. 

Speeding up the descendants search. Define the set of far points F = {q £ Q : d(q,yi*) > 
^p-2* } for some e' > to be determined later. The set of near points is N := Q \ F. The 
points in F are ignored in this phase of the algorithm. For the points in iV we wish to find good 
representatives so that the number of representatives is few, and the additive distortion caused 
by replacing the query points in iV with their representative is very small. To this end, consider 
the set of representatives obtained as follows. Each point q £ N is mapped to its ancestor in the 
compacted T\q which is in for the largest k < i* — log(l/e") for e" > to be determined later. 
Call this set of representatives R s ", and give each r G R £ n a weight wt(r) which is the number of 
points in N that were mapped to r. Notice that the process of this mapping and weighting can 
be done efficiently by scanning the compacted T\q in 0(n) time. Now, for each center candidate 
x obtained by a descendants search from each of the points in L y .„^*^ c by using Lemma |2.5| with 
refinement constant e", we compute Xlre-R » d(r,x) -wt(r), and take the candidate which minimizes 
this cost. 
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We want to argue that the candidate returned is a 1 + e approximation from the optimum. 
Denote this candidate by x. Notice that one of the candidates must be a point a e » which is an 
ancestor of a in T and is in Yk for some k < i* — log(l/e"). Therefore, ^rei? » d (r,x) ■ wt(r) < 
T,reR e „ d ( r > a e") ' wt(r). 

Lemma 4.7. "£ q eQ x ) < i 1 + e ) EggQ d (Q, a ) ■ 

Proof. Denote by Xj*_i the c-list-ancestor of x in Ly., i* c C i- First, for every q £ F, 
d(q,x) < d(q,yi*) + d(y i * 1 x i *-{) + d(xi*-i,x) 



< d(q,yi*) + c2* 1 + c2* by Xj*_i G L y .* t i* tC and Lemma 2J3 

< d(q,yi*) + e'd(q,yi*) since g G F 
= (l + ^d^yi.), 



and similarly, 



d(q,yi*) < d(q,a) + d(a,aj»_i) + d((H*-i,yi*) 



< d(q, a) + 2 % + c2 l by Lemma 4.3 



< d(q, a) + e'd(q, y,*). since q £ F 

Therefore, d(q,yi*) < . Combining this with our earlier inequality, we get 

^%,x)<[±^^%,a). (10) 

For the near points, we have 

£ x) < ^ a £ ») < ^ [d(g, a) + d(a, a e „)] < £ [<% a) + e"Z i *+ 1 ] . (11) 

qgTV ggN q£N q£N 

Altogether, the cost of the reported center candidate x is 

£ x) < ^ d( 9 , a) + ^ «) + e"2** +1 ] by Eqns. ©,© 

<JSQ q£F q£N 

* 137 £ «) + (a/2 - 2 3)(l-/3) H d(? ' a) by Lemma S 



r^ + (a/2- 2 3)(l-/3))^^' a) - 



96Q 



Setting e" := ( a / 2 ^il an d e ' := we ge t that Yl q eQ d {<l, x ) - ( 1 + e ) EgeQ a )- '-' 
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4.3 Runtime Analysis 



In the first part, the compacted version of T\q is constructed (in line 1) in time 0(n log n) using 
Lemma ^j]. At each iteration i we locate y (in line 9), which becomes j/j* . The runtime of this step 
is proportional to the number of candidates in the c-list multiplied by the size Ri. The number of 
candidates is |L„ }i)C | < 2°( ddim ). The size of Ri is at most the number of points in which are at 
most c'2 i away from Vi . We conclude that \Ri\ < 2 ddiml °g( c ' 2l+1 /2 l ) < 2 0(ddim)_ 

Computing Ri from takes 2°( ddim ) time per member of R%+i, for a total of 2°( ddim ) 

per iteration i. Thus, the total time spent on finding yi* is 0(nlogn + 2°( ddim ) log A). For the 
descendants search used in the refinement to (1 + e)-approximation, we can bound the number of 
representatives as follows. 

Lemma 4.8. \R E „\ < £ ~0(ddim) _ 

Proof. If all of the representatives are in 5^*-i g(i/e") then the size of R e " is at most the number 
of points in ^*_i g(i/ e ") which are at most ^-2** away from y.;*. The number of such points is 
bounded above by 

2 0(ddimlog(^ r 2 i */2 i *- 1 ° g(1/e ")) _ 20(ddimlog 5 ^£ 7r ) < £ -0(ddim) 

However, the representatives do not all have to be in Yj»_ iog(i/ e ")- To overcome this, we charge 
each representative in R £ n to a different point in ^*_i g(i/ e ")- This mapping is done by assigning 
to each point in R £ n its ancestor in the un-compacted T\q which is in i^*-i g(i/e")- Notice that 
no two points in R £ n can be assigned to the same point in i^*-iog(i/e")j as otherwise there would 
be another node in the compacted T\q which is an ancestor of those two points and in for 
k < i* — log(l/e"), which contradicts the method in which the representatives were picked. □ 

The total cost of the descendants search is the number of representatives multiplied by the 



number of center candidates. From Lemma 2.5 we know that the number of candidates is at most 
£ n— o(ddim) < £ -0(ddim)^ anc j therefore, the runtime of this refinement stage is bounded by e -0(ddmi)_ 
Overall, the runtime of computing a (1 + e) -approximation for the 1-median is 0{n log n) + 



2 0(ddim) log ^(M) + e-OCddim)^ and thig comp i etes t h e proof of Theorem O 



5 Algorithm for 1-Center 

It is helpful to see the solution for the 1-center problem prior to seeing the solution for the p-center 
problem, as many of the ideas are similar, and the exposition with only one center is simpler. 
Therefore, we first prove the following. 

Theorem 5.1. There is an algorithm that preprocesses a finite metric M in time 2°( ddim )mlog Aloglog A(M) 
using 2°( ddim )m memory words, where m = \M\, ddim = ddim(M) and A = A(M), so that subse- 
quent 1-center queries on a set Q C M , can be answered with approximation factor 1 + e, for any 
desired < e < 1/2, in time 0(n log n + log log log A + e -0( ddim )) ; where n = \Q\. 

The preprocessing algorithm simply builds the net hierarchy for the metric M, and prepares it 
for weighted level ancestor queries (see Section [|). For the query, we first recall a trivial algorithm 
that provides a 2-approximation for the 1-center problem on a query set Q C M, and then refine 
it to provide a (1 + ^-approximation. 
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Let a G M be an optimal center, and denote its value by OPT := max^gQ d(a, q). Notice that 
every point y G Q gives a 2-approximation, because its objective value is 



ALG := max d(y, q) < maxMy, a) + d(a, y)} < 2 ■ OPT. 
<?e<2 <?eQ 

We thus pick any point y G Q as our first approximation, and proceed to the refinement stage. 

Refinement to (1 + e)— approximation. Let i be an integer such that 2 l ~ l < ALGo < 2% and 
notice that OPT < ALGq < 2 l . We begin by locating the ancestor € Yj of y in T. This can be 
done using a weighted level ancestor query |FM96 , KL07]. We next show that a is fairly close to 
Vi- 

Lemma 5.2. Let a^-\ G l^-i 6e an ancestor of a in T. Then aj_i G L yi i %. 
Proof. For every point g G Q, 

Ifc) < d(oi_i, a) + d(a, q) + d(?, y) + yi ) < 2* + OPT + 2 • OPT + 2 4+1 < 6 • 2\ 

□ 

This lemma implies that the optimal center a is a descendant in T of some point in L y . ^. 
Executing a descendants search from all the points in L yu i$ by using Lemma 2^ with refinement 



constant e will guarantee that we traverse a point a e such that d(a, a £ ) < e2*. Denote the set of the 
points seen in such a descendants search by D. Unfortunately, this process computes (separately) 
the cost of each candidate traversed by taking the maximum distances from all of Q to that 
candidate, which would take time e~°( ddim ^n. We can speed up this process by using (a few) 
representatives of Q, as is explained next. 

Speeding up the descendants search. We wish to find a bounded-size set of representatives 
for the points in Q, such that the distortion caused by considering them (instead of Q) is small. To 
this end, consider the set of representatives obtained as follows. Each point q G Q is mapped to its 
ancestor in the compacted T\q which is in for the largest k < i — log(l/e'), for some refinement 
constant e' = 0(e) to be determined later. Call this set of representatives R £ i. Notice that R £ i 
is a subset of the compacted T\q and thus the process of this mapping can be done efficiently by 
scanning the compacted T\q in linear time. Now, for each center candidate x G D we compute 
max r gij E , d(r,x), and return the candidate x that minimizes this cost. 

The next lemma shows that this algorithm achieves (1 + e)~approximation. 

Lemma 5.3. cntr(Q, {x}) = max gg Q d(x, q) < (1 + e)OPT. 



Proof. Every q G Q has a representative in R £ >, for which we can apply Lemma 2.4 and the triangle 
inequality, and thus 

max d(x,q) < maxd(x,r) +e'2 i+1 . 

q&Q r£R e , 

Recall that one of the center candidates is some a £ > G ^-i og (i/e') that is an ancestor of a in T. 
Therefore, the returned center x satisfies 

m&xd(x,r) < max d(a E r, r). 
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Let r* £ R e * be a maximizer for the righthand side, and let q* £ Q be such that r* is a representative 
of q*. Using the triangle inequality and Lemma [TJ again, 

d{a £ ,,r*) < d{a, a e >) + d{a, q*) + d{q* ,r*) < OPT + 2 • e'2 i+l . 

Recalling from earlier that 2* < 2ALGo < 40PT, we finally combine the inequalities above and 
conclude that max geQ d(x, q) < OPT + 3 • e2 i+1 < (1 + 24e)OPT. To complete the proof, set e' to 
be a power of 2 in the range [^, □ 

Runtime. The running time of the above query process is as follows. Locating j/j using a weighted 
level ancestor query takes 0(log log log A) time as there are only log A possible nets. After con- 
structing T\q in 0(n log n) time, the mapping of every q £ Q to its representative takes, alto- 
gether, 0(n) time. The descendants search for each of the 0(2 ddim ) points in L yu i$ takes time 
£ /-0(ddim) <- £ — O(ddim) time, w mch also bounds the number of candidates. The number of repre- 
sentatives for Q is bounded by the following lemma. 

Lemma 5.4. \R £ ,\ < £ -0(ddim)_ 

Proof. If all of the representatives are in 5^_ log then the size of R £ ' is at most the number of 
points in l^_i og which are at most 2' away from g/j. The number of such points is bounded above 
by 

2ddim-log(272 I - 1 °g( 1 /e:')) _ q , /\0(ddim) _ £ -0(ddim) 

However, the representatives do not all have to be in i^*-i g(i/e')- ^° overcome this, we charge each 
representative in R £ i to a different point in 3^*_i g(i/e')- This mapping is done by assigning to each 
point in R £ i its ancestor in the un-compacted T\q which is in ^*-i g(i/e')- Notice that no two points 
in R £ i can be assigned to the same point in ^*-i g(i/ E ')j as otherwise there would be another node 
in the compacted T\q which is an ancestor of those two points and in Y& for k < i* — log(l/e'), 
which contradicts the method in which the representatives were picked. □ 

It follows that the time it takes to evaluate the cost of all center candidates is (altogether) 
£ -0(ddim\ and thus the algorithm's total runtime of is 0(ralogn + log log log A + e - °( ddin1 )). 



6 Algorithm for p-center 

Theorem 6.1. There is an algorithm that preprocesses a finite metric M in time 2°( ddim )mlog Aloglo^ 
using 2°( ddim ^m memory words, where m = \M\, ddim = ddim(M) and A = A(M), so that subse- 
quent p-center queries on a set Q C M , can be answered with approximation factor 1 + e, for any 
desired < e < 1/2, in time 0(nlogn + p log log log A _)_ pP+ 1 £ -°(p- ddim )) ; where n= \Q\. 

The preprocessing algorithm simply builds the net hierarchy for the metric M, and prepares 
it for weighted level ancestor queries (see Section |2|). For the query, we first use the algorithm 
of Gonzalez from |Gon85 [ on Q, which obtains a 2-approximation for the p-center in 0(p ■ n) 



time. In other words, the algorithm locates a set B C Q of size p such that if we denote its 
objective value as ALGo := max gG Q d(q, B), and if A C M is an optimal p-center set with value 
OPT := max 9eQ d(q, A), then ALG < 2 • OPT. 
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6.1 Refinement to (1 + e)— approximation. 

Let i be an integer such that 2*" 1 < ALGq < 2\ For each b G B locate the ancestor bi G Yi of b in 



T. This can be done using a weighted level ancestor query [FM96, KL07]. For a center a G OPT, 
let ai-i G Y{_i be an ancestor of a in T. 

Lemma 6.2. For every a G OPT there exists a point b £ B such that Oj_i G L^^t^. 

For every point q £ Q which is assigned to a in OPT, let b be the center of the cluster of q in 
B. Then d(a,i-i,bi) < d( ai ,a) + d(a,q) + d(q,b) + d(b,bi) < 2 i + OPT + 2 • OPT + 2 i+1 < 6 • 2\ □ 

This implies that a center a G A is a descendant in T of some point in IJfeeB ^b^ifi- Performing 
a descendants search from each of the points in by using Lemma [2.5| for some refinement 

constant e' = 9(e) to be determined later, will guarantee that for each a G OPT we traverse a point 
a such that d(a,d) < e'2 l . Denote the union of the points seen in such a descendants search by D. 
Unfortunately, this process computes (separately) the cost of each subset of size p of candidates 
traversed by taking the maximum distances from all of Q to that subset, which would take time 
£ -0(ddim) n ^ -yy e can S p eec } U p ^^jg process by using (a few) representatives of Q. 

Speeding up the descendants search. We wish to find a bounded-size set of representatives 
for the points in Q, such that the distortion caused by considering them (instead of Q) is small. To 
this end, consider the set of representatives obtained as follows. Each point q G Q is mapped to its 
ancestor in the compacted T\q which is in for the largest k < i — log(l/e'), for some refinement 
constant e' = 0(e) to be determined later. Call this set of representatives R £ i. Notice that R £ i 
is a subset of the compacted T\q and thus the process of this mapping can be done efficiently by 
scanning the compacted T\q in linear time. Now, for each set of p center candidates X C D we 
compute max rg fl , d(r, X), and take the set of candidates X that minimizes this cost. 
The next lemma shows that this algorithm achieves (1 + ^-approximation. 

Lemma 6.3. cntr(Q, {X}) = max, gQ d(X , q) < (1 + e)OPT. 



Proof. Every q G Q has a representative in R £ i , for which we can apply Lemma 2A and the triangle 
inequality, and thus 

m&xd(X,q) < max d(X, r) + e'2 i+l . 
q eQ r£R F , 

Recall that one of the sets of center candidates is some A £ C ^i-i g(i/ e ') that is the set of ancestors 
of every a G A in T, where A is an optimal solution. Therefore, the returned center set X satisfies 

max d(X,r) < max d(A £ >, r). 

Let r* G R £ ' be a maximizer for the righthand side, and let q* G Q be such that r* is a representative 
of q* . Let a G A be the center in A which is closest to q*, and let a £ i be the ancestor of a in A E >. 



Using the triangle inequality and Lemma 2.4 again, 

d(A £ ,, r*) < d(a £ ,,a) + d(a, q) + d(q, r*) < OPT + 2 • e'2 i+1 . 

Recalling from earlier that 2 l < 2ALGo < 40PT, we finally combine the inequalities above and 
conclude that max ge Q d(X, q) < OPT + 3 • s2 i+1 < (1 + 24e)OPT. To complete the proof, set e' to 
be a power of 2 in the range [^, □ 
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6.2 Runtime 



The running time of the above process is as follows. Locating frj for all b € B using a weighted level 
ancestor queries takes 0{p log log log A) as there are only log A possible nets. After constructing T\q 
in 0(n log n) time, the mapping of each q € Q to its representative takes another O(n) time. The 
descendants search from all of the 0(2 ddim ) points in L h . >ifi takes 0{pe'-°( dAiia ">) = 0(pe"°( ddim )), 
which also bounds the number of candidates. The number of representatives can be bounded by 
the following lemma. 

Lemma 6.4. \R E ,\ < pe -0(ddim)_ 

Proof. If all of the representatives are in i^-iog^/e') then the size of R £ > is at most the number of 
points in ^_i g(i/ e ') which are at most 2* away from each of the p points in B. The number of such 
points is bounded above by 

p£ t-0(ddim) < ^ e -0(ddim)_ 

However, the representatives do not all have to be in ^*_i g(i/ e ')- To overcome this, we charge each 
representative in R £ > to a different point in 3^*_i og (i/ £ ')- This mapping is done by assigning to each 
point in R £ i its ancestor in the un-compacted T\q which is in ^*-i g(i/e')- Notice that no two points 
in R £ i can be assigned to the same point in ^*_i g(i/ e ')> as otherwise there would be another node 
in the compacted T\q which is an ancestor of those two points and in for k < i* — log(l/e'), 
which contradicts the method in which the representatives were picked. □ 

Thus, the time it takes to test each of the ( pe ) candidates is at most 0(pe-°( ddim )), and 

the total runtime of the algorithm is 0(n log n + p log log log A + pP+ig-p-ddim^ Notice that the 
runtime of the algorithm of Gonzalez is 0(np), which is always bounded from above by 0(n log n + 
p p ~ 1 ), and can thus be absorbed by the other terms. 



7 Algorithm for j>-median 

Theorem 7.1. There is an algorithm that preprocesses a finite metric M in time 2°( ddim )m. log A log lo£ 
using 2 c, ( ddim )m memory words, where m = \M\, ddim = ddim(M), and A = A(M), so that subse- 
quent p-median queries on a set Q C M of size n, can be answered within approximation factor 1+e, 
for any desired e G (0, in time 0(n log n) + e -C( ddim ) ( p . l og n)°W • log log log A + e -0(p ddim) ^ . 
logn)°W. 



To a large extent, we follow an algorithm of Har-Peled and Mazumdar [HM04] for approximating 
p-median clustering in Euclidean space. Their algorithm runs in time roughly 0(n + exp(e _d )(p • 
logn) ^ 1 )), where d is the dimension in Euclidean space (in a scenario without preprocessing). In 
order to give a flavor of our preprocessing model, we focus on the case of small p and employ 
an abridged version of their algorithm, with runtime that grows exponentially with p. We note 
that following their techniques more closely may possibly reduce the runtime, like eliminating the 
exponential dependence on p. 



Proof (Sketch). At a high level, the algorithm of Har-Peled and Mazumdar [HM04| works as follows. 
First, construct a set A of p := p ■ log ^ n centers that provides a constant factor approximation 
of the p-median (formally, it is a bicriteria approximation, since p > p). Next, construct a core-set 
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S by building an exponential grid (as defined below) around each of the centers in A, and mapping 
each point in Q to its (approximate) closest grid point, using near neighbor search. The size of the 
core-set is roughly \S\ < e _d |^4|logn (but of course these points have weights that add up to n). 
This means that every solution to the p-median problem on S is a good approximation for the p- 
median problem on Q. Finally, construct another set of exponential grids around each of the points 
in S to obtain a centroid set D, i.e., set of potential centers in the ambient (Euclidean) space, of size 
roughly \D\ < e~ d \S\°^ < e~ 2d (p ■ logn) ^ 1 ). Finally, use a variant of the dynamic programming 
algorithm of Kolliopoulos and Rao [KROTf to quickly compute a near-optimal p-median of S among 
the potential centers D. 



This algorithm of Har-Peled and Mazumdar HM04] carries over to our scenario (possibly using 



some different black-box data structures, for example to solve nearest neighbor search), except for 
the following two main ingredients. The first is the construction of the exponential grid, which 
strongly relies on being in Euclidean space, and the ability to define points in ambient space, which 
we do not enjoy in doubling dimension metrics. The second is the dynamic programming solution 
of Kolliopoulos and Rao [KR07|, which also exploits the Euclidean space structure. We solve the 



exponential grid using T, as shown below, and skip the use of dynamic programming by performing 
a brute- force search over all size p subsets (of the centroid set). It is plausible that our runtime can 
be improved by adapting the solution of Kolliopoulos and Rao |KR07 | to work in our case as well, 
and we leave this for future work. 

Exponential grid. The exponential grid of Har-Peled and Mazumdar fHM04H around a point r 
with length parameter R > roughly works as follows. They build O(logn) axis-parallel squares, 
where the j th square has side length of i?2 J and is partitioned into sub-squares (i.e., a grid) of side 
length 0(eR2 3 /d), the idea being that areas closer to the point r have smaller cell size, while areas 
further away have larger cell size. This construction does not carry over to doubling dimension 
metrics as we cannot define grid points in ambient space. However, we make use of T to provide a 
set with similar properties. 

We provide a sketch of the idea in order to ease presentation, but point out that some of our 
constants can to be refined. Given r, we use a weighted level ancestor query |[FM96| , [RTIiTH to locate 



its ancestor r\ og R € ^i gR m T. A descendants search, using Lemma starting from r\ og R with 
refinement constant e will give us a good resolution for points that are roughly at most distance R 
away from r. Let rj + \ og R £ Yj+iogR be an ancestor of r in T, for < j < O(logn). We perform a 



descendants search using Lemma 2.5 starting from each such rj + \ OS R, with refinement constant e. 
Notice that for a specific j, for points within distance 2 J+log ^ = 2?R from r, the descendant search 
starting from rj+i gfl reaches a set of points which are in 3^+iog.R— iog(i/e) which is similar to the 
resolution obtained from the exponential grid in Euclidean space. 

The union of all of the points seen during all of the descendants searches on all O(logn) levels 
provides a set of size £ _ °( ddim ) logn, which gives us (i.e., in doubling dimension metrics) the same 
properties as the exponential grid does in a Euclidean space. Thus we obtain a core-set S and 
centroid set D both of size at most e -°( ddim )(p . logn) ^ 1 ). Finally, perform an exhaustive search 
through all subsets of size p of the centroid set D and compute the cost of each such set, which takes 
total time ('J 1 ) • 0(\S\p) < e -°( P -ddim)( p . l og n)°( p ). Notice that a weighted level ancestor query 
is performed for each of the points in the core-set S, which increases the runtime by £-°( ddim ) (^ . 
log n)°^ ■ log log log A. □ 



21 



References 



[AGGM06] I. Abraham, C. Gavoillc, A. V. Goldberg, and D. Malkhi. Routing in networks with low doubling 
dimension. In 26th IEEE International Conference on Distributed Computing Systems, page 75. 
IEEE, 2006. 

[AP02] P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. Algo- 
rithmica, 33:201-226, 2002. 

[ARR98] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for euclidean fc-medians and related 
problems. In 13th Annual ACM Symposium on the Theory of Computing, pages 106-113, 1998. 

[BF00] M. A. Bender and M. Farach-Colton. The LCA problem revisited. In LATIN 2000: Theoretical 
Informatics, pages 88-94, 2000. 

[BGK12] Y. Bartal, L.-A. Gottlieb, and R. Krauthgamcr. The traveling salesman problem: Low- 
dimensionality implies a polynomial time approximation scheme. In ^th symposium on Theory 
of Computing, pages 663-672. ACM, 2012. 

[BHI02] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In 34th Annual 
ACM Symposium on Theory of Computing, pages 250-257, 2002. 

[BKL06] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In 23rd interna- 
tional conference on Machine learning, pages 97-104. ACM, 2006. 

[BLL09] N. H. Bshouty, Y. Li, and P. M. Long. Using the doubling dimension to analyze the generalization 
of learning algorithms. Journal of Computer and System Sciences, 75(6):323 - 335, 2009. 

[CG06] R. Cole and L.-A. Gottlieb. Searching dynamic point sets in spaces with bounded doubling 
dimension. In 38th annual ACM symposium on Theory of computing, pages 574-583. ACM, 
2006. 

[Che06] K. Chen. On fc-median clustering in high dimensions. In 17th Annual ACM-SIAM Symposium 
on Discrete algorithm, pages 1177-1185. ACM, 2006. 

[FM96] M. Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In 
7th Annual Symposium on Combinatorial Pattern Matching, pages 130-140, 1996. 

[GKK10] L.-A. Gottlieb, L. Kontorovich, and R. Krauthgamcr. Efficient classification for metric data. In 
23rd Conference on Learning Theory, pages 433-440. Omnipress, 2010. 

[GKL03] A. Gupta, R. Krauthgamcr, and J. R. Lee. Bounded geometries, fractals, and low-distortion 
embeddings. In ^th Annual IEEE Symposium on Foundations of Computer Science, pages 
534-543, October 2003. 

[Gon85] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoret. Comput. 
Sci., 38(2-3):293-306, 1985. 

[HM04] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In 36th 
Annual ACM Symposium on Theory of Computing,, pages 291-300, 2004. 

[HM06] S. Har-Peled and M. Mendel. Fast construction of nets in low-dimensional metrics and their 
applications. SI AM Journal on Computing, 35(5):1148-1184, 2006. 

[HT84] D. Hard and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. 
Comput, 13(2):338-355, 1984. 

[Ind99] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual 
ACM Symposium on Theory of Computing, pages 428-434. ACM, 1999. 



22 



[KL04a] R. Krauthgamer and J. R. Lee. The black-box complexity of nearest neighbor search. In 
31st International Colloquium on Automata, Languages and Programming, Lecture Notes in 
Computer Science, pages 858-869. Springer, July 2004. 

[KL04b] R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. In 
15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 791-801, January 2004. 

[KL07] T. Kopclowitz and M. Lewenstein. Dynamic weighted ancestors. In 18th Annual ACM-SIAM 
Symposium on Discrete Algorithms,, pages 565-574, 2007. 

[KR07] S. G. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean 
k-median problem. SIAM J. Comput., 37(3):757-782, 2007. 

[KRX08] G. Konjevod, A. W. Richa, and D. Xia. Dynamic routing and location services in metrics of low 
doubling dimension. In 22nd International Symposium on Distributed Computing, volume 5218 
of Lecture Notes in Computer Science, pages 379-393. Springer, 2008. 

[KSS10] A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems 
in any dimensions. J. ACM, 57(2), 2010. 

[KSW09] J. Klcinbcrg, A. Slivkins, and T. Wcxler. Triangulation and embedding using small sets of 
beacons. J. ACM, 56:32:1-32:37, September 2009. 

[Sli07] A. Slivkins. Distance estimation and object location via rings of neighbors. Distributed Com- 
puting, 19:313-333, 2007. 

[Tal04] K. Talwar. Bypassing the embedding: Algorithms for low dimensional metrics. In Proceedings 
of the 36th Annual ACM Symposium on Theory of Computing, pages 281-290, 2004. 



23 



